The -s (scale) parameter was set to 0 in the pipeline. This parameter is interpreted as log2(s), so the setting of 0 results in adapter trimming after a single base match. We noticed over-aggressive trimming of the ends of reads due to such a low threshold for matching. In this report, we initially look at the differences of several -s settings and then continue to look at the differences between a stting of 0 and 1.6.

Libraries

library(ggplot2)
library(ggpubr)
library(dplyr)
library(plotly)

Difference in total counts

A match of 1 and 2 bases looks similar, while higher settings result in a sharp dropoff.

trim_counts <- read.table("~/Projects/CD4_CD8_NK_trimming_investigations/trimming_results.csv", header=TRUE, sep=",")

ggplot(trim_counts, aes(x=as.character(scale), y=clipped_end_reads)) +
  geom_bar(stat="identity") +
  xlab("-s parameter") + ylab("Clipped reads") +
  ggtitle("Effect of -s parameter on read clipping") +
  theme_pubr()

Read the counts data in

Look at the HTSeq resulting gene count data for the setting of 0 and 1.6.

zero_counts <- read.table("~/Projects/CD4_CD8_NK_trimming_investigations/counts/0/CD4_100B2_HTSeq_counts.txt", row.names=1)
one_six_counts <- read.table("~/Projects/CD4_CD8_NK_trimming_investigations/counts/1.6/CD4_100B2_HTSeq_counts.txt", row.names=1)

bound_counts <- zero_counts
names(bound_counts) <- "lower_thresh"
bound_counts$higher_thresh <- one_six_counts$V2
bound_counts$diff <- bound_counts$lower_thresh - bound_counts$higher_thresh
bound_counts$ratio_change <- bound_counts$diff/bound_counts$lower_thresh

Scatter plot

Observe no outliers when comparing the counts between the settings. Any changes are minor and in a consistent direction.

ggplotly(
ggplot(bound_counts, aes(x=log2(lower_thresh), y=log2(higher_thresh), label=rownames(bound_counts))) +
  geom_point(shape=1) +
  xlab("log2(-s 0 resulting counts)") + ylab("log2(-s 1.6 resulting counts)") +
  ggtitle("Difference in resulting counts\ndue to -s parameter") +
  theme_pubr()
)

Total difference in counts

Look at the total difference in HTSeq counts due to the -s parameter change

sum(bound_counts$lower_thresh)-sum(bound_counts$higher_thresh)
[1] 144020

Zero count genes

Number of genes lost completely

colSums(bound_counts[1:2]==0)
 lower_thresh higher_thresh 
         9305          9313 

Average change

Look at the average per-gene counts change, using the absolute differences

mean(abs(bound_counts$ratio_change))
[1] NaN

Gene gain

Which genes do we gain with the higher thresh?

gained_genes <-
  setdiff(
    rownames(bound_counts[which(bound_counts$lower_thresh==0),]),
    rownames(bound_counts[which(bound_counts$higher_thresh==0),])
  )

bound_counts[gained_genes,1:3]

Gene loss

Which genes do we lose with the higher thresh?

lost_genes <-
  setdiff(
    rownames(bound_counts[which(bound_counts$higher_thresh==0),]),
    rownames(bound_counts[which(bound_counts$lower_thresh==0),])
  )

bound_counts[lost_genes,1:3]

Highest diffrences

Look at the genes which show the largest differences in counts

bound_counts %>% rownames_to_column() %>% top_n(50, diff) %>% arrange(-diff)
LS0tCnRpdGxlOiAiVHJpbW1pbmcgcGFyYW1ldGVyIGNoYW5nZXMgfCBmYXN0cS1tY2YgLXMgZmxhZyIKc3VidGl0bGU6ICJMaW5kYSBDcm5pYyBJbnN0aXR1dGUgZm9yIERvd24gU3luZHJvbWUiCmF1dGhvcjogIktvaGwgS2lubmluZyIKZGF0ZTogImByIGZvcm1hdChTeXMudGltZSgpLCAnJWQgJUIgJVknKWAiCm91dHB1dDoKICBodG1sX25vdGVib29rOiAKICAgIGNvZGVfZm9sZGluZzogbm9uZQogICAgdGhlbWU6IGNvc21vCiAgICB0b2M6IHllcwogICAgdG9jX2Zsb2F0OgogICAgICBjb2xsYXBzZWQ6IHllcwplZGl0b3Jfb3B0aW9uczogCiAgY2h1bmtfb3V0cHV0X3R5cGU6IGlubGluZQotLS0KClRoZSBgLXNgIChzY2FsZSkgcGFyYW1ldGVyIHdhcyBzZXQgdG8gYDBgIGluIHRoZSBwaXBlbGluZS4gVGhpcyBwYXJhbWV0ZXIgaXMgaW50ZXJwcmV0ZWQgYXMgbG9nMihzKSwgc28gdGhlIHNldHRpbmcgb2YgYDBgIHJlc3VsdHMgaW4gYWRhcHRlciB0cmltbWluZyBhZnRlciBhIHNpbmdsZSBiYXNlIG1hdGNoLiAgV2Ugbm90aWNlZCBvdmVyLWFnZ3Jlc3NpdmUgdHJpbW1pbmcgb2YgdGhlIGVuZHMgb2YgcmVhZHMgZHVlIHRvIHN1Y2ggYSBsb3cgdGhyZXNob2xkIGZvciBtYXRjaGluZy4gSW4gdGhpcyByZXBvcnQsIHdlIGluaXRpYWxseSBsb29rIGF0IHRoZSBkaWZmZXJlbmNlcyBvZiBzZXZlcmFsIGAtc2Agc2V0dGluZ3MgYW5kIHRoZW4gY29udGludWUgdG8gbG9vayBhdCB0aGUgZGlmZmVyZW5jZXMgYmV0d2VlbiBhIHN0dGluZyBvZiBgMGAgYW5kIGAxLjZgLgoKIyMjIExpYnJhcmllcwoKYGBge3J9CmxpYnJhcnkoZ2dwbG90MikKbGlicmFyeShnZ3B1YnIpCmxpYnJhcnkoZHBseXIpCmxpYnJhcnkocGxvdGx5KQpgYGAKCiMjIyBEaWZmZXJlbmNlIGluIHRvdGFsIGNvdW50cyAKCkEgbWF0Y2ggb2YgMSBhbmQgMiBiYXNlcyBsb29rcyBzaW1pbGFyLCB3aGlsZSBoaWdoZXIgc2V0dGluZ3MgcmVzdWx0IGluIGEgc2hhcnAgZHJvcG9mZi4KYGBge3J9CnRyaW1fY291bnRzIDwtIHJlYWQudGFibGUoIn4vUHJvamVjdHMvQ0Q0X0NEOF9OS190cmltbWluZ19pbnZlc3RpZ2F0aW9ucy90cmltbWluZ19yZXN1bHRzLmNzdiIsIGhlYWRlcj1UUlVFLCBzZXA9IiwiKQoKZ2dwbG90KHRyaW1fY291bnRzLCBhZXMoeD1hcy5jaGFyYWN0ZXIoc2NhbGUpLCB5PWNsaXBwZWRfZW5kX3JlYWRzKSkgKwogIGdlb21fYmFyKHN0YXQ9ImlkZW50aXR5IikgKwogIHhsYWIoIi1zIHBhcmFtZXRlciIpICsgeWxhYigiQ2xpcHBlZCByZWFkcyIpICsKICBnZ3RpdGxlKCJFZmZlY3Qgb2YgLXMgcGFyYW1ldGVyIG9uIHJlYWQgY2xpcHBpbmciKSArCiAgdGhlbWVfcHVicigpCmBgYAoKIyMjIFJlYWQgdGhlIGNvdW50cyBkYXRhIGluCgpMb29rIGF0IHRoZSBIVFNlcSByZXN1bHRpbmcgZ2VuZSBjb3VudCBkYXRhIGZvciB0aGUgc2V0dGluZyBvZiBgMGAgYW5kIGAxLjZgLgpgYGB7cn0KemVyb19jb3VudHMgPC0gcmVhZC50YWJsZSgifi9Qcm9qZWN0cy9DRDRfQ0Q4X05LX3RyaW1taW5nX2ludmVzdGlnYXRpb25zL2NvdW50cy8wL0NENF8xMDBCMl9IVFNlcV9jb3VudHMudHh0Iiwgcm93Lm5hbWVzPTEpCm9uZV9zaXhfY291bnRzIDwtIHJlYWQudGFibGUoIn4vUHJvamVjdHMvQ0Q0X0NEOF9OS190cmltbWluZ19pbnZlc3RpZ2F0aW9ucy9jb3VudHMvMS42L0NENF8xMDBCMl9IVFNlcV9jb3VudHMudHh0Iiwgcm93Lm5hbWVzPTEpCgpib3VuZF9jb3VudHMgPC0gemVyb19jb3VudHMKbmFtZXMoYm91bmRfY291bnRzKSA8LSAibG93ZXJfdGhyZXNoIgpib3VuZF9jb3VudHMkaGlnaGVyX3RocmVzaCA8LSBvbmVfc2l4X2NvdW50cyRWMgpib3VuZF9jb3VudHMkZGlmZiA8LSBib3VuZF9jb3VudHMkbG93ZXJfdGhyZXNoIC0gYm91bmRfY291bnRzJGhpZ2hlcl90aHJlc2gKYm91bmRfY291bnRzJHJhdGlvX2NoYW5nZSA8LSBib3VuZF9jb3VudHMkZGlmZi9ib3VuZF9jb3VudHMkbG93ZXJfdGhyZXNoCmBgYAoKCgojIyMgU2NhdHRlciBwbG90CgpPYnNlcnZlIG5vIG91dGxpZXJzIHdoZW4gY29tcGFyaW5nIHRoZSBjb3VudHMgYmV0d2VlbiB0aGUgc2V0dGluZ3MuIEFueSBjaGFuZ2VzIGFyZSBtaW5vciBhbmQgaW4gYSBjb25zaXN0ZW50IGRpcmVjdGlvbi4KYGBge3J9CmdncGxvdGx5KApnZ3Bsb3QoYm91bmRfY291bnRzLCBhZXMoeD1sb2cyKGxvd2VyX3RocmVzaCksIHk9bG9nMihoaWdoZXJfdGhyZXNoKSwgbGFiZWw9cm93bmFtZXMoYm91bmRfY291bnRzKSkpICsKICBnZW9tX3BvaW50KHNoYXBlPTEpICsKICB4bGFiKCJsb2cyKC1zIDAgcmVzdWx0aW5nIGNvdW50cykiKSArIHlsYWIoImxvZzIoLXMgMS42IHJlc3VsdGluZyBjb3VudHMpIikgKwogIGdndGl0bGUoIkRpZmZlcmVuY2UgaW4gcmVzdWx0aW5nIGNvdW50c1xuZHVlIHRvIC1zIHBhcmFtZXRlciIpICsKICB0aGVtZV9wdWJyKCkKKQpgYGAKCgoKCiMjIyBUb3RhbCBkaWZmZXJlbmNlIGluIGNvdW50cwoKTG9vayBhdCB0aGUgdG90YWwgZGlmZmVyZW5jZSBpbiBIVFNlcSBjb3VudHMgZHVlIHRvIHRoZSBgLXNgIHBhcmFtZXRlciBjaGFuZ2UKYGBge3J9CnN1bShib3VuZF9jb3VudHMkbG93ZXJfdGhyZXNoKS1zdW0oYm91bmRfY291bnRzJGhpZ2hlcl90aHJlc2gpCmBgYAoKCiMjIyBaZXJvIGNvdW50IGdlbmVzCgpOdW1iZXIgb2YgZ2VuZXMgbG9zdCBjb21wbGV0ZWx5CmBgYHtyfQpjb2xTdW1zKGJvdW5kX2NvdW50c1sxOjJdPT0wKQpgYGAKCiMjIyBBdmVyYWdlIGNoYW5nZQoKTG9vayBhdCB0aGUgYXZlcmFnZSBwZXItZ2VuZSBjb3VudHMgY2hhbmdlLCB1c2luZyB0aGUgYWJzb2x1dGUgZGlmZmVyZW5jZXMKYGBge3J9Cm1lYW4oYWJzKGJvdW5kX2NvdW50cyRkaWZmKSkKbWVhbihhYnMoYm91bmRfY291bnRzJHJhdGlvX2NoYW5nZSkpCmBgYAoKCgojIyMgR2VuZSBnYWluCgpXaGljaCBnZW5lcyBkbyB3ZSBnYWluIHdpdGggdGhlIGhpZ2hlciB0aHJlc2g/CmBgYHtyfQpnYWluZWRfZ2VuZXMgPC0KICBzZXRkaWZmKAogICAgcm93bmFtZXMoYm91bmRfY291bnRzW3doaWNoKGJvdW5kX2NvdW50cyRsb3dlcl90aHJlc2g9PTApLF0pLAogICAgcm93bmFtZXMoYm91bmRfY291bnRzW3doaWNoKGJvdW5kX2NvdW50cyRoaWdoZXJfdGhyZXNoPT0wKSxdKQogICkKCmJvdW5kX2NvdW50c1tnYWluZWRfZ2VuZXMsMTozXQpgYGAKCgoKIyMjIEdlbmUgbG9zcwoKV2hpY2ggZ2VuZXMgZG8gd2UgbG9zZSB3aXRoIHRoZSBoaWdoZXIgdGhyZXNoPwpgYGB7cn0KbG9zdF9nZW5lcyA8LQogIHNldGRpZmYoCiAgICByb3duYW1lcyhib3VuZF9jb3VudHNbd2hpY2goYm91bmRfY291bnRzJGhpZ2hlcl90aHJlc2g9PTApLF0pLAogICAgcm93bmFtZXMoYm91bmRfY291bnRzW3doaWNoKGJvdW5kX2NvdW50cyRsb3dlcl90aHJlc2g9PTApLF0pCiAgKQoKYm91bmRfY291bnRzW2xvc3RfZ2VuZXMsMTozXQpgYGAKCgojIyMgSGlnaGVzdCBkaWZmcmVuY2VzCgpMb29rIGF0IHRoZSBnZW5lcyB3aGljaCBzaG93IHRoZSBsYXJnZXN0IGRpZmZlcmVuY2VzIGluIGNvdW50cwpgYGB7cn0KYm91bmRfY291bnRzICU+JSByb3duYW1lc190b19jb2x1bW4oKSAlPiUgdG9wX24oNTAsIGRpZmYpICU+JSBhcnJhbmdlKC1kaWZmKQpgYGAKCg==